Overview

Description

We decided to work on one of the most burning time series problem of today’s day and era, “predicting web traffic”. We believe that this forecasting can help website servers a great deal in effectively handling outages. The technique we implemented can be extended to diverse applications in financial markets, weather forecasts, audio and video processing. Not just that, understanding your website’s traffic trajectory can open up business opportunities too!

Dataset

The dataset consists of 145k time series representing the number of daily page views of different Wikipedia articles, starting from July 1st, 2015 up until September 10th, 2017 (804 data points). The goal is to forecast the daily views between September 13th, 2017 and November 13th, 2017 (64 data points) for each article in the dataset.Traditional moving averages, ARIMA based techniques

Let’s import the important libraries

# visualization
library('ggplot2') 

# data manipulation
library('dplyr')
library('readr')
library('imputeTS')


#time series
library('fpp')
library('forecast')
library('xts')
library('zoo')
library(prophet)

Data Preparation

A dataset like this would have a lot of interventions because the web traffic views are dependent on external factors and can spike up on one day and return to the same level in a while. So let’s start with searching a nice stationary model wave where we can use our ARIMA methods, and then we can start working on interventions

# importing the dataset
train_1 <- read.csv("~/Desktop/UChicago/Quarters/03-Quarters/Data/TS/web-traffic-time-series-forecasting/train_1.csv", header = TRUE, row.names = 1,sep = ",",skip =0)

Let’s have a look at different time series available

# importing the dataset
Google <- data.matrix(train_1[c("Google_zh.wikipedia.org_all-access_spider"),])
dimnames(Google)<-NULL
Google<-array(Google)
time_index <- seq(from = as.POSIXct("2015-07-01"), to = as.POSIXct("2016-12-31"), by = "day")
Google_ts <- xts(Google, order.by =time_index ,frequency = 365.25)
autoplot(Google_ts)

We can see that the data more or less appears stationary but we have a spike im 2015 December. This is a very common pattern we can see when we have web traffic data since it’s heavily influenced by external factors. These spikes are called as interventions, and we have a separate notebooks as to how to deal with these interventions. In this notebook we would be dealing only with a data that appears more or less stationary. Let’s have a look at other cases.

# importing the dataset
IPhone <- data.matrix(train_1[c("IPhone_zh.wikipedia.org_all-access_spider"),])
dimnames(IPhone)<-NULL
IPhone<-array(IPhone)
time_index <- seq(from = as.POSIXct("2015-07-01"), to = as.POSIXct("2016-12-31"), by = "day")
IPhone_ts <- xts(IPhone, order.by =time_index ,frequency = 365.25)
autoplot(IPhone_ts)

Then we have a data like this i.e. for Iphone wikipedia page. In this case intervention actually changes the level, I guess it probably because, Apple events usually occur in September of each year and that changes the pattern. Apple also has events in early March which is also a contributing factor. Therefore, such a time series isn’t stationary and the interventions in this case change the level of the TS as well.

We found one dataset that appears more or less stationary, let’s have a look

# importing the dataset
ASCII <- data.matrix(train_1[c("ASCII_zh.wikipedia.org_all-access_spider"),])
dimnames(ASCII)<-NULL
ASCII<-array(ASCII)
plot(ASCII,type='l')

The time series appears to have increasing trend with no signs of seasonality

Let’s create a time series object and split the data into test and train. We would use data from 2015-07-01 to 2016-08-31 as train data and data from 2016-09-01 to 2016-12-31 as test data.

time_index <- seq(from = as.POSIXct("2015-07-01"), to = as.POSIXct("2016-12-31"), by = "day")
ASCII_ts <- xts(ASCII, order.by =time_index ,frequency = 365.25)
tsdisplay(ASCII_ts,ylab="ASCII daily traffic",xlab="Day")

ASCII_train<-ASCII_ts['2015-07-01/2016-09-30']
ASCII_test<-ASCII_ts['2016-10-01/2016-12-31']
tsdisplay(ASCII_train,ylab="ASCII daily traffic",xlab="Day")

tsdisplay(ASCII_test,ylab="ASCII daily traffic",xlab="Day")

length(ASCII_test)
## [1] 92

We can see from ACF and PACF that the dataset isn’t dying down fast which means that the dataset isn’t stationary and hence we would need differencing. Let’s have a look at that as well.

Let’s look if we need to do Box Cox transformation to decouple mean and variance

BoxCox.lambda(ASCII_train)
## [1] 0.3897714

Yes, we need to perform a transformation with lambda = 0.3922381

Also confirming our assumption that the data isn’t stationary

kpss.test(ASCII_train)
## 
##  KPSS Test for Level Stationarity
## 
## data:  ASCII_train
## KPSS Level = 5.6063, Truncation lag parameter = 5, p-value = 0.01

Since the p value is less than 0.05, the data isn’t stationary

Let’s apply one level of differencing and check

ASCII_train_boxcox<-ASCII_train %>% BoxCox(lambda = BoxCox.lambda(ASCII_train))
ASCII_train_diff <- diff(ASCII_train_boxcox)
kpss.test(ASCII_train_diff)
## 
##  KPSS Test for Level Stationarity
## 
## data:  ASCII_train_diff
## KPSS Level = 0.017647, Truncation lag parameter = 5, p-value = 0.1
tsdisplay(ASCII_train_diff)

Naive Forecasts

The data is better now, and the looks stationary. Let’s start with naive forecasts first before we movie into ARIMA model

#forecast horizon
h<-92
#naive forecasts
ASCII_train_new<-ts(ASCII_ts['2015-07-01/2016-09-30'])
ASCII_test_new<-ts(ASCII_ts['2016-10-01/2016-12-31'])

#evaluating the models
Model_Mean <- meanf(ASCII_train_new, h) 
Model_Naive <- naive(ASCII_train_new, h) 
Model_Drift <- rwf(ASCII_train_new, h, drift=TRUE)

#Naive forecast
autoplot(ASCII_train_new) +
  autolayer(Model_Mean$mean, series="Mean") +
  autolayer(Model_Naive$mean, series="Naive") +
  autolayer(Model_Drift$mean, series="Drift") +
  ggtitle("Forecasts for daily ASCII Wikepedia Page") +
  xlab("Days") + ylab("ASCII traffic")

Let’s have a look at the metrics - Out of sample metrics-test

accuracy(Model_Mean,ASCII_test)
##                        ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 1.288732e-15 9.247481 7.477975 -35.02098 58.12142 0.3481664
## Test set     5.076182e+00 9.681532 7.009018  11.46968 23.96929 0.3263323
##                   ACF1
## Training set 0.5328766
## Test set            NA
accuracy(Model_Naive,ASCII_test)
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.04157549 8.917198 6.956236 -12.684160 40.37525 0.3238748
## Test set     1.55434783 8.389305 6.228261  -3.046881 24.13922 0.2899810
##                    ACF1
## Training set -0.4634768
## Test set             NA
accuracy(Model_Drift,ASCII_test)
##                         ME     RMSE      MAE       MPE     MAPE      MASE
## Training set  4.791771e-16 8.917101 6.959238 -12.94458 40.45229 0.3240146
## Test set     -3.789126e-01 7.981986 6.161046 -10.65661 25.72335 0.2868516
##                    ACF1
## Training set -0.4634768
## Test set             NA

It can be seen from above metrics that drift is performing the best because the trend is increasing but not a stable model as the assumption is that it would always increase. Let’s have a look at the ARIMA model

ARIMA Model

auto.arima(ASCII_train,seasonal = TRUE,lambda = 'auto')
## Series: ASCII_train 
## ARIMA(2,1,3) with drift 
## Box Cox transformation: lambda= 0.3897714 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3   drift
##       0.7449  -0.9047  -1.5888  1.4992  -0.8159  0.0073
## s.e.  0.0570   0.0682   0.0683  0.1202   0.0668  0.0042
## 
## sigma^2 estimated as 1.153:  log likelihood=-678.97
## AIC=1371.94   AICc=1372.19   BIC=1400.82
m1<-Arima(ASCII_train,lambda = 'auto',order=c(2,1,3),include.drift = TRUE)
checkresiduals(m1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,3) with drift
## Q* = 12.022, df = 4, p-value = 0.01719
## 
## Model df: 6.   Total lags used: 10

We can see that residuals have no autocorrelation and appear stationary. Let’s have a look at metrics

autoplot(forecast(m1,h=92))

accuracy(forecast(m1,h=92),ASCII_test)
##                      ME     RMSE      MAE        MPE     MAPE     MASE
## Training set  0.5616526 6.780914 5.236161  -8.861843 29.96078 0.243790
## Test set     -2.3523442 8.302841 6.700521 -18.718375 29.63149 0.311969
##                     ACF1
## Training set 0.001079399
## Test set              NA

Looking at the metrics it has the lowest Train and a lower RMSE as compared to the other models.

EACF - ARIMA

Let’s check if there are better model using the Extended Auto correlation function

source("~/Desktop/UChicago/Quarters/03-Quarters/03-31006-TimeSeries/03-Week/eacf.r")
#differencing and box cox transforming the training data 
ASCII_train_new_boxcox<-ASCII_train_new %>% BoxCox(lambda = BoxCox.lambda(ASCII_train_new))
ASCII_train_new_diff <- diff(ASCII_train_new_boxcox)
eacf(ASCII_train_new_diff)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o x x o o o x x o  o  o  o 
## 1 x x o x x o o o x o o  o  o  o 
## 2 x o o x x o o o x o o  x  o  o 
## 3 x x x x o o o o x o o  o  o  o 
## 4 x o x x x o o o x o x  o  o  o 
## 5 x x o o x x o o o o x  o  o  o 
## 6 x x o o x o o o o o x  o  o  o 
## 7 x o x x x o x x o o x  o  o  x

Note: We don’t wish to complicate the model and hence would keep pmax and qmax to be 2,2 Trying different models from above matrix p=0,q=1 p=0,q=2 p=1,q=2 p=2,q=1 p=2,q=2

m2<-Arima(ASCII_train,lambda = 'auto',order=c(0,1,1),include.drift = TRUE)
m3<-Arima(ASCII_train,lambda = 'auto',order=c(0,1,2),include.drift = TRUE)
m4<-Arima(ASCII_train,lambda = 'auto',order=c(1,1,2),include.drift = TRUE)
m5<-Arima(ASCII_train,lambda = 'auto',order=c(2,1,1),include.drift = TRUE)
m6<-Arima(ASCII_train,lambda = 'auto',order=c(2,1,2),include.drift = TRUE)
cbind(m1$aicc,m2$aicc,m3$aicc,m4$aicc,m5$aicc,m6$aicc)
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 1372.194 1378.982 1376.102 1377.536 1377.471 1379.025

By Aicc values, model1 is the best. Therefore we would go with auto arima model.

Holt Winters

m_hw <- holt(ASCII_train,h=92) 
autoplot(m_hw)

m_hw_damp <- holt(ASCII_train,h=92,damped = TRUE) 
autoplot(m_hw_damp)

accuracy(m_hw)
##                        ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.005428262 6.854588 5.318492 -12.91236 31.87033 0.2476232
##                    ACF1
## Training set 0.07815376
accuracy(m_hw_damp)
##                     ME     RMSE      MAE       MPE    MAPE      MASE       ACF1
## Training set 0.3755877 6.831301 5.282292 -11.99129 31.6822 0.2459377 0.07318879
checkresiduals(m_hw)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt's method
## Q* = 17.04, df = 6, p-value = 0.009136
## 
## Model df: 4.   Total lags used: 10
checkresiduals(m_hw_damp)

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt's method
## Q* = 17.039, df = 5, p-value = 0.004426
## 
## Model df: 5.   Total lags used: 10

The Ljung Box test tells us that residuals are correlated as the pvalues are 0.05 means Failure to reject null hypothesis which means that the model isn’t a good fit.

Cross Validation

Let’s use the best model found i.e. ARIMA(2,1,3)

m7 <- function(x, h){forecast(Arima(x, order=c(2,1,3),lambda = 'auto',include.drift=TRUE), h=h)}
error_1 <- tsCV(ASCII_train, m7, h=1)
error_2 <- tsCV(ASCII_train, m7, h=1, window = 12) # Rolling/Sliding Window

Let’s have look at the errors

autoplot(error_1, series = 'Expanding Window') +
  autolayer(error_2, series = 'Rolling Window') 

Let’s have a look at the values as well

print(sqrt(mean(error_1^2, na.rm=TRUE))) 
## [1] 7.229139
print(sqrt(mean(error_2^2, na.rm=TRUE))) 
## [1] 12.39619

Prophet

time_index <- seq(from = as.POSIXct("2015-07-01"), to = as.POSIXct("2016-09-30"), by = "day")
df<-cbind(as.data.frame(time_index),y=ASCII_ts['2015-07-01/2016-09-30'])
colnames(df)<-c('ds','y')
m <- prophet(df)
future <- make_future_dataframe(m, periods = 92)
forecast <- predict(m, future)
plot(m, forecast)

autoplot(ASCII_ts)